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Abstract 

A  methodology  was  desired  for  optimizing  the  Numerical  Electromagnetics  Code  (NEC) 
on  a  given  platform.  The  platform  chosen  was  the  Convex  mini-supercomputer.  The  matrix 
fill  and  factor  times  were  the  gauges  of  optimizing  for  speed.  The  software  tool  for  choosing 
where  to  optimize  was  the  profiler  that  comes  with  the  FORTRAN  compiler. 

NEC2,  NEC3,  and  NEC 4  were  evaluated.  The  test  cases  were  models  of  44,  300,  722, 
and  2286  segments.  Three  levels  of  built-in  compiler  optimizations  were  used.  Additional 
optimizations  were  sought.  The  greatest  speedup  in  runtime  came  with  the  use  of  LINPACK 
library  routines  specifically  optimized  for  the  Convex. 

INTRODUCTION 

This  study  gives  a  methodology  for  optimizing  the  NEC  codes  (or  any  method  of 
moments  code)  for  a  g:ven  platform,  in  this  case,  a  Convex  mini-supercomputer. 

The  Convex  Computer  Corporation  mini-supercomputers  have  become  very  popular 
because  of  their  hign  power  for  the  dollar.  The  model  used  for  this  study  was  the  Convex  C240 
which  is  commonly  classified  as  a  mini-supercomputer.  Its  vector  architecture  makes  it  a 
supercomputer  and  it  is  smaller  than  a  Cray,  making  it  a  mini.  Its  cogent  features  are  as 
follows: 

•  4  processors  -  50  MegaFLOPS  each 

•  Each  processor  includes  scalar  and  vector  processing  units 

•  Peak  performance  -  200  MegaFLOPS 

•  LINPACK1000  benchmark:  162  MegaFLOPS 

(Cray  Y-MP):  305  MegaFLOPS  DTIC  QUALITY  IKoZZCTED  3 

•  Whetstone  benchmark:  38  MIPS 

(Cray  Y-MP):  26  MIPS 

•  MULTIunits  benchmark:  4900 

(Cray  Y-MP):  6000 

Each  processor  has 

•  8  vector  registers  of  128  elements  each 

•  Each  element  (word)  consists  of  64  bits 

•  There  are  3  independent  functional  unit  controllers: 

•  Load  and  Store 

•  Multiply  and  Divide 

•  Add  and  Logical 


SCOPE 


NEC  runs  were  made  on  various  combinations  of  the  following  parameters: 


Codes 


Name 

Number  of  Lines 

Number  of  Routines 

NEC2 

8,734 

81 

NEC  3 

9,780 

99 

NEC4 

16,039 

207 

NEC2  was  chosen  for  its  complete  documentation;  NEC4,  for  being  the  latest  and  greatest  and 
NEC3  to  round  out  the  family. 

Compiler  Optimizations  -  None,  Scalar,  Vector,  Parallel  - 

There  are  three  types  of  automatic  optimizations  that  come  with  the  FORTRAN  compiler. 
The  scalar  optimization  performs  a  great  many  types  of  both  machine  dependent  and  machine 
independent  optimizations  on  the  scalar  level.  The  vector  optimization  seeks  loops  that  are 
actually  dealing  with  arrays.  As  much  as  possible,  the  entire  loop  is  converted  to  vector 
operations.  The  parallel  optimization  operates  only  within  individual  routines.  It  tries  to  spread 
the  processing  among  the  four  processors  if  it  would  be  more  efficient. 

In  the  following  discussion,  the  optimizations  are  labeled  as  follows. 

Optimization  Types 
0  None 

1  Scalar 

2  Scalar  +  Vector 

3  Scalar  +  Vector  +  Parallel 

Models  -  44  segment,  300  segment,  722  segment,  2286  segment  - 

The  44  segment  model  is  a  one  wavelength  loop.  The  300  segment  model  is  a  monopole 
on  a  ground  plane.  The  722  segment  model  is  the  US  Navy’s  Spruance  (DD-963)  class 
destroyer  segmented  for  up  to  6  MHz  problems.  The  2286  segment  model  is  the  same  ship 
segmented  for  30  MHz  problems. 

Information  Gathered 

For  each  run,  the  following  information  was  gathered: 

•  Matrix  fill,  matrix  factor,  and  total  run-time 

•  A  profile  of  the  run  listing  each  routine  and  operation  used  and  for  each: 

•  Percentage  of  total  run-time  used  in  calls  to  the  routine  or  operation 

•  The  number  of  calls  to  the  routine  or  operation 

•  The  time  used  in  a  call  to  the  routine  or  operation 

For  some  of  the  runs,  some  additional  information  was  gathered:  the  percent  used  of  all 
the  processors,  the  amount  of  memory  used,  the  physical  reads  and  writes,  the  number  of  page 
faults,  and  the  number  of  page  faults  paged  out  to  disk. 


Manual  Optimization 


Looking  at  the  profiles  of  the  runs,  routines  were  chosen  to  be  optimized  beyond  the 
automatic  optimizations  of  the  quite  intelligent  compiler. 


RESULTS 


Verification 

The  impedance  of  an  antenna  on  each  of  the  models  was  used  to  verify  that  a  run  was 

valid. 

Profiles 

To  gauge  the  performance  of  each  run,  the  profiler  that  comes  with  the  FORTRAN 
compiler  was  used.  There  is  some  overhead  in  its  use  as  it  performs  its  counts  and  timings  as 
seen  in  the  examples  below  for  a  722  segment  model. 


NEC3 , 

optim.2,  w/o  profiler 

Fill 

106.943 

Factor 

37.999 

Total 

148.343 

with  profiler 

120.596 

32.316 

157.349 

NEC4 , 

optim.3,  w/o  profiler 

128.441 

49.368 

187.582 

with  profiler 

155.841 

45.500 

213.442 

All  times  in  the  following  data  and  discussion  presume  the  use  of  the  profiler.  You  will  see  in 
the  following  profiles  an  item  called  "mcount".  This  is  one  of  the  profiler  overhead  items. 

Hie  following  series  of  profiles  shows  the  differences  between  NEC2,  NEC3,  and  NEC4 
for  a  722  segment  model  using  compiler  optimization  2  in  all  cases.  The  routines  or  functions 
that  take  up  more  than  5%  of  the  total  runtime  are  shown. 


NEC2 

Fill 

177.541 

Factor 

32.436 

Total 

213.422 

%time 

cumsecs 

#call 

ms/call 

name 

15.7 

34.84 

521284 

0.07 

ef  Id 

15.4 

69.07 

1021498 

0.03 

eksc 

15.2 

102.82 

1 

33750.00 

f  actr 

14.7 

135.38 

4901342 

0.01 

gf 

9.8 

157.13 

21750ms 

mcount 

9.1 

177.24 

1021498 

0.02 

intx 

7.5 

194.00 

2042996 

0.01 

gx 

6.5 

208.49 

722 

20.07 

__cmww 

NEC3 

Fill 

Factor 

Total 

120.596 

32.316 

157.349 

%time 

cumsecs 

#call 

ms/call 

name 

20.5 

33.62 

1 

33620.00 

f  actr 

17.1 

61.58 

887915 

0.03 

eksclr 

15.7 

87.28 

521284 

0.05 

ef  Id 

11 . 5 

106.19 

18910ms 

mcount 

10.6 

123.57 

722 

24.07 

cmww 

5.5 

132.50 

1351794 

0.01 

_mth$c_exp 

NEC4 

Fill 

Factor 

Total 

146 

.313 

32.366 

190. 199 

%tirae 

cumaecs 

/call 

ms/call 

name 

17.1 

33.82 

286 

118.25 

_factr 

14.9 

63.24 

887913 

0.03 

_ekscl 

13.5 

89.95 

267 10ms 

mcount 

10.8 

111.23 

1042568 

0.02 

_ef Ids 

8.6 

128.22 

722 

23.53 

cmww 

The  following  show  the  differences  in  profiles  as  the  size  of  the  model  changes.  (NEC4 
is  used  with  optimization  3). 


The  last  profile  was  of  a  run  using  UNPACK  routines,  discussed  next. 


Manual  Optimization 

A  widely  available  set  of  routines  for  solving  linear  equations,  called  UNPACK,  was 
available  specifically  optimized  for  the  Convex  hardware.  Two  routines  were  chosen  to  replace 
the  matrix  factor  and  solve  portions  of  the  NEC  codes. 

Function  NEC  routine  UNPACK  routine 

Factor  matrix  factr  cgefa 

Solve  matrix  solve  cgesl 

In  both  cases,  because  of  the  way  NEC  stores  matrices,  the  interaction  matrix  had  to  be 
transposed  before  and  after  the  UNPACK  routines  were  used. 

Next,  routines  high  in  the  profile  list  were  sought  that  could  benefit  from  manipulation 
so  that  the  compiler  could  vectorize  them.  In  NEC2:  efld,  eksc,  gf,  intx,  and  rest  all  had  no 
loops.  In  NEC4:  eksclr,  efldsg ,  and  ekscsz  all  had  no  loops.  In  both:  cmww  was  already 
automatically  70%  vectorized  by  the  compiler.  All  other  routines  consumed  less  than  5%  of  the 
total  runtime.  It  was  not  considered  worthwhile  to  continue  the  optimization  effort. 


Runtimes 

The  following  lists  the  impedances  and  runtimes  of  the  significant  runs. 


Seaments 

NEC 

Optim 

LinPack 

Impedance 

R  X 

Times  (seconds) 
FILL  FACTOR 

TOTAL 

44 

2 

2 

100.6 

-139.1 

0.36 

0.02 

0.47 

44 

4 

3 

100.3 

-140.6 

0.47 

0.05 

0.69 

44 

4 

3 

yes 

100.3 

-140.6 

0.47 

0.01 

0.66 

300 

2 

0 

232.3 

-36.4 

9.54 

19.28 

29.86 

300 

2 

1 

232.3 

-36.4 

9.20 

12.97 

22.98 

300 

2 

2 

232.3 

-36.4 

10.23 

2.36 

13.30 

300 

2 

3 

232.3 

-36.4 

10.63 

4.39 

15.80 

300 

4 

3 

240.3 

-37.1 

23.87 

4.44 

30.29 

300 

4 

3 

yes 

240.3 

-37.1 

24.12 

1.71 

27.86 

722 

2 

0 

20.3 

-0.5 

168.27 

268.14 

441.56 

722 

2 

1 

20.3 

-0.5 

176.80 

180.83 

361.73 

722 

2 

2 

20.3 

-0.4 

177.54 

32.44 

213.42 

722 

2 

2 

yes 

20.3 

-0.4 

176.06 

22.08 

202.21 

72  2 

2 

3 

20.3 

-0.4 

179.58 

45.46 

228.81 

722 

3 

2 

19.7 

-2 

120.60 

32.32 

157.35 

722 

4 

2 

23 

0.5 

146.31 

32.37 

190.20 

722 

4 

3 

23 

0.5 

155.84 

45.50 

213.44 

722 

4 

3 

yes 

23 

0.5 

157.37 

22.18 

193.51 

2286 

4 

3 

34 

20.8 

1437.45 

5678.13 

7220.75 

2286 

4 

3 

yes 

34 

20.8 

1451.86 

700.77 

2304.25 

In  matrix  format 

FILL  RUNTIMES  (sec) 


Segments 

NEC 

1  Optimization 

0 

1 

2 

UnP 

3 

LinP 

44 

2 

0.36 

4 

0.47 

0.47 

300 

2 

10 

9 

10 

11 

4 

24 

24 

722 

168 

177 

178 

176 

180 

121 

4 

146 

156 

157 

2286 

4 

1437 

1452 

2286 


FACTOR  RUNTIMES  (sec! 


Segments 

NEC 

1 

2 

LinP 

3 

LinP 

44 

2 

■ 

■ 

0.02 

4 

■ 

1 

0.05 

0.01 

300 

2 

19 

13 

2 

4 

4 

4 

2 

722 

am 

268 

181 

32 

22 

45 

ih 

32 

EH 

32 

46 

22 

2286 

am 

5678 

701 

TOTAL  RUNTIMES  !soc) 


NEC 

1  Optimization 

■Hi 

0 

1 

2 

LinP 

3 

LinP 

44 

2 

0.47 

4 

0.69 

0.66 

300 

2 

30 

23 

13 

16 

4 

30 

28 

722 

am 

442 

362 

213 

202 

229 

EH 

157 

EH 

190 

213 

194 

2286 

4 

7221 

2304 

Figures  1,  2,  3,  4,  and  5  graphically  show  the  trends  in  the  data. 


CONCLUSIONS 

Just  as  important  as  the  speed  of  the  machine  is  the  way  the  software  utilizes  its 
resources.  For  the  case  of  the  Convex  computer  used  in  this  study,  the  automatic  optimizations 
supplied  with  its  FORTRAN  compiler  cut  the  time  for  a  NEC  run  in  half  for  a  722  segment 
model. 

Library  routines  optimized  for  a  machine’s  hardware  should  be  used  whenever  possible 
to  replace  existing  code.  Two  routines  especially  appropriate  for  a  method  of  moments  code  are 
the  matrix  factor  and  matrix  solve  routines  from  the  LINPACK  library.  These  had  been 
optimized  for  the  Convex  hardware.  For  a  2,286  segment  model,  they  cut  the  factor  time  by 
a  factor  of  8  resulting  in  an  overall  runtime  improvement  of  a  factor  of  3. 
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